TD Control: Sarsa

Monte Carlo (MC) control methods require us to complete an entire episode of interaction before updating the Q-table. Temporal Difference (TD) methods will instead update the Q-table after every time step.

## Video

TD Control Sarsa Part 1

Watch the next video to learn about Sarsa (or Sarsa(0) ), one method for TD control.

## Video

TD Control Sarsa Part 2

## Pseudocode

In the algorithm, the number of episodes the agent collects is equal to num_episodes . For every time step t\geq 0 , the agent:

takes the action A_t (from the current state S_t ) that is \epsilon -greedy with respect to the Q-table,
receives the reward R_{t+1} and next state S_{t+1} ,
chooses the next action A_{t+1} (from the next state S_{t+1} ) that is \epsilon -greedy with respect to the Q-table,
uses the information in the tuple ( S_t , A_t , R_{t+1} , S_{t+1} , A_{t+1} ) to update the entry Q(S_t, A_t) in the Q-table corresponding to the current state S_t and the action A_t .